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Abstract 


Matrix operations such as matrix inversion, eigenvalue decomposition, singular value de¬ 
composition are ubiquitous in real-world applications. Unfortunately, many of these matrix 
operations so time and memory expensive that they are prohibitive when the scale of data 
is large. In real-world applications, since the data themselves are noisy, machine-precision 
matrix operations are not necessary at all, and one can sacrifice a reasonable amount of 
accuracy for computational efficiency. 

In recent years, a bunch of randomized algorithms have been devised to make matrix 
computations more scalable. Mahoney [16] and Woodruff [34] have written excellent but 
very technical reviews of the randomized algorithms. Differently, the focus of this paper is 
on intuition, algorithm derivation, and implementation. This paper should be accessible to 
people with knowledge in elementary matrix algebra but unfamiliar with randomized matrix 
computations. The algorithms introduced in this paper are all summarized in a user-friendly 
way, and they can be implemented in lines of MATLAB code. The readers can easily follow 
the implementations even if they do not understand the maths and algorithms. 

Keywords: matrix computation, randomized algorithms, matrix sketching, random projec¬ 
tion, random selection, least squares regression, randomized SVD, matrix inversion, eigen¬ 
value decomposition, kernel approximation, the Nystrom method. 
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Chapter 1 
Introduction 


Matrix computation plays a key role in modern data science. However, matrix computations 
such as matrix inversion, eigenvalue decomposition, SVD, etc, are very time and memory 
expensive, which limits their scalability and applications. To make large-scale matrix com¬ 
putation possible, randomized matrix approximation techniques have been proposed and 
widely applied. Especially in the past decade, remarkable progresses in randomized numer¬ 
ical linear algebra has been made, and now large-scale matrix computations are no longer 
impossible tasks. 

This paper reviews the most recent progresses of randomized matrix computation. The 
papers written by Mahoney [16] and Woodruff [34] provide comprehensive and rigorous 
reviews of the randomized matrix computation algorithms. However, their focus are on the 
theoretical properties and error analysis techniques, and readers unfamiliar with randomized 
numerical linear algebra can have difficulty when implementing their algorithms. 

Differently, the focus of this paper is on intuitions and implementations, and the target 
readers are those who are familiar with basic matrix algebra but has little knowledge in 
randomized matrix computations. All the algorithms in this paper are described in a user- 
friend way. This paper also provides MATLAB implementations of the important algorithms. 
MATLAB code is easy to understand^, easy to debug, and easy to translate to other lan¬ 
guages. The users can even directly use the provided MATLAB code without understanding 
it. 

This paper covers the following topics: 

• Chapter 2 briefly reviews some matrix algebra preliminaries. This chapter can be 
skipped if the reader is familiar with matrix algebra. 

• Chapter 3 introduces the techniques for generating a sketch of a large-scale matrix. 

• Chapter 4 studies the least squares regression (LSR) problem where n S> d. 

• Chapter 5 studies efficient algorithms for computing the fc-SVD of arbitrary matrices. 

'^If your are unfamiliar with a MATLAB function, you can simply type “help -f functionname” in MAT¬ 
LAB and read the documentation. 
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• Chapter 6 introduces techniques for sketching symmetric positive semi-dehnite (SPSD) 
matrices. The applications includes spectral clustering, kernel methods (e.g. Gaussian 
process regression and kernel PGA), and second-order optimization (e.g. Newton’s 
method). 
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Chapter 2 

Elementary Matrix Algebra 


This chapter dehnes the matrix notation and goes through the very basics of matrix decom¬ 
positions. Particularly, the singular value decomposition (SVD), the QR decomposition, and 
the Moore-Penrose inverse are used throughout this paper. 


2.1 Notation 


Let A = [ttij] be a matrix, a = [a*] be a column vector, and a be a scalar. The i-th row 
and j-th column of A are denoted by a*, and a^-, respectively. When there is no ambiguity, 
either column or row can be written as a^. Let be the n x n identity matrix, that is, the 
diagonal entries are ones and off-diagonal entries are zeros. The column space (the space 
spanned by the columns) of A is the set of all possible linear combinations of its column 
vectors. Let [n] be the set {1, 2, • • • ,n}. Let nnz(A) be the number of nonzero entries of A. 

The squared vector £2 norm is dehned by 


a 


2 

2 



i 


The squared matrix Frobenius norm is dehned by 


iaiif = 




and the matrix spectral norm is dehned by 


lAlU = max 


lAxI 


^^0 || X ||2 


2.2 Matrix Decompositions 

QR decomposition. Let A be an m x n matrix with m > n. The QR decomposition of A 
is 

A = 

mxn nxn 
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The matrix Qa has orthonormal columns, that is, QIQa = i„. The matrix Ra is upper 
triangular, that is, for all i < j, the {i,j)-th entry of Ra is zero. 

SVD. Let A be an m X n matrix and p = rank(A). The condensed singular value 
decomposition (SVD) of A is 

p 

= Ua Sa Va = 

mxn mxp pxp pxn 


Here (Ta,i > ■ ■ ■ > crA,p > 0 are the singular values, ua,i, ■ ■ ■ , ua,p € are the left singular 
vectors, and va,i, • ■ ■ , va,p G M"' are the right singular vectors. Unless otherwise specihed, 
“SVD” refers to the condensed SVD. 

fc-SVD. In applications such as the principal component analysis (PCA), latent semantic 
indexing (LSI), word2vec, spectral clustering, we are only interested in the top k (<^ m,n) 
singular values and singular vectors. The rank k truncated SVD (/c-SVD) is denoted by 

k 

Afc := crA,iUA,i^A,i 

i=l 

Here Ua./c consists of the hrst k singular vectors of Ua, and Sa,^ and Vv,fc are analogously 
dehned. Among all the mxn rank k matrices, A^ is the closest approximation to A in that 

Afc = argmin IIA — X||^ = argmin || A — X|| 2 , s.t. rank(X) </c. 

X X 

Eigenvalue decomposition. The eigenvalue decomposition of an n x n symmetric 
matrix A is dehned by 



mxk kxk kxn 


A = UaAaUa = >^A,i^A,i^A,i- 

i=l 

Here Aa,i > • • • > ^A,n are the eigenvalues of A, and ua,i, • ■ ■ , UA,n G 1^" are the correspond¬ 
ing eigenvectors. A symmetric matrix A is called symmetric positive semidehnite (SPSD) 
if and only if all the eigenvalues are nonnegative. If A is SPSD, its SVD and eigenvalue 
decomposition are identical. 


2.3 Matrix (Pseudo) Inverse and Orthogonal Projector 

For an n X n square matrix A, the matrix inverse exists if A is non-singular (rank(A) = n). 
Let A“^ be the inverse of A. Then AA~^ = A^^A = I„. 

Only square and full rank matrices have inverse. For the general rectangular matrices or 
rank dehcient matrices, matrix pseudo-inverse is used as a generalization of matrix inverse. 
The book [1] offers a comprehensive study of the pseudo-inverses. 
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The Moore-Penrose inverse is the most widely used pseudo-inverse, which is dehned by 

A* := 

Let A be any m x n and rank p matrix. Then 

AAt = = Ua^, 

=Ip mxp pxm 

which is a orthogonal projector. It is because for any matrix B, the matrix AA^B = UaUaB 
is the projection of B onto the column space of A. 

2.4 Time and Memory Costs 

The time complexities of the matrix operations are listed in the following. 

• Multiplying an m x n matrix A by an n x p matrix B; 0{mnp) float point operations 
(flops) in general, and 0{p ■ nnz(A)) if A is sparse. Here nnz(A) is the number of 
nonzero entries of A. 

• QR decomposition, SVD, or Moore-Penrose inverse of an m x n matrix [m > n): 
Oimin?) flops. 

• k-SVD of an m X n matrix: 0{nmk) flops (assuming that the spectral gap and the 
logarithm of error tolerance are constant) 

• Matrix inversion or full eigenvalue decomposition of an n x n matrix; 0{n‘^) flops 

• /c-eigenvalue decomposition of an n x n matrix: 0{'n?k) flops. 

Pass-efRcient means that the algorithm goes constant passes through the data. For 
example, the Frobenius norm of a matrix can be computed pass-efficiently, because each 
entry is visited only once. In comparison, the spectral norm cannot be computed pass- 
efficiently, because the algorithm goes at least log ^ passes through the matrix, which is not 
constant. Here e indicates the desired precision. 

Memory cost. If an algorithm scans a matrix for constant passes, the matrix can be 
placed in large volume disks, so the memory cost is not a bottleneck. However, if an algorithm 
goes through a matrix for many passes (not constant passes), the matrix should be placed 
in memory, otherwise the swaps between memory and disk would be highly expensive. In 
this paper, memory cost means the number of entries frequently visited by the algorithm. 
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Chapter 3 

Matrix Sketching 


Let A G be the given matrix, S G be a sketching matrix, e.g. random projection 

or column selection matrix, and C = AS G be a sketch of A. The size of C is much 

smaller than A, but C preserves some important properties of A. 

3.1 Theoretical Properties 

The sketching matrix is useful if it has either or both of the following properties. The two 
properties are important, and the readers should try to understand them. 

Property 3.1 (Subspace Embedding). For a fixed m x n (m n) matrix A and all m- 
dimension vector y, the inequality 

1 Ily^ASIIi 
7 - Ily^Alli - ^ 

holds with high probability. Here S G M"^® (s n) is a certain sketching matrix. 

The subspace embedding property can be intuitively understood in the following way. 
For all n dimensional vectors x in the row space of A (a rank m subspace within the 

length of vector x does not change much after sketching: ||x ||2 ~ ||xS|||. This property can 
be applied to speedup the £2 regression problems. 

Property 3.2 (Low-Rank Approximation). Let A be any mxn matrix and k be any positive 
integer far smaller than m and n. Let C = AS G M”^^® where S G M""^® is a certain sketching 
matrix and s > k. The Frobenius norm error bouncP 

\\A-CC^A\\l < r]\\A-Ak\\l 

holds with high probability for some > 1. 

^Thus there always exists an m dimensional vector y such that x can be expressed as x = y^A. 
^Spectral norm bounds should be more interesting. However, spectral norm error is difficult to analyze, 
and existing spectral norm bounds are “weak” for their factors y are far greater than 1. 


9 




The following error bound is stronger and more interesting: 

min ||A-CX||^ < r]\\A-Ak\\p. 

rank(X)<fc 

It is stronger because ||A — CC'I’A|||, < mini.ank(x)<fc ||A — CX|||,. 

Intuitively speaking, the low-rank approximation property means that the columns of A^ 
are almost in the column space of C = AP. The low-rank approximation property enables 
us to solve fc-SVD more efficiently (for k < s). Later on we will see that computing the 
fc-SVD of CClA is less expensive than the fc-SVD of A. 

The two properties can be verified by a few lines of MATLAB code. The readers are 
encouraged to have a try. With a proper sketching method and a relatively large s, both 7 
and r] should be near one. 

3.2 Random Projection 

The section presents three matrix sketching techniques: Gaussian projection, subsampled 
randomized Hadamard transform (SRHT), and count sketch. Gaussian projection and SRHT 
can be combined with count sketch. 

3.2.1 Gaussian Projection 

The n X s Gaussian random projection matrix S is a matrix is formed by S = :^G, where 
each entry of G is sampled i.i.d. from A/'(0,1). The Gaussian projection is also well knows 
as the Johnson-Lindenstrauss transform due to the seminal work [15]. Gaussian projection 
can be implemented in four lines of MATLAB code. 


1 function [C] = GaussianProjection (A, s) 

2 n = size (A, 2 ); 

38= randn(n, s) / sqrt(s); 

4 C = A * S; 


Gaussian projection has the following properties: 

• Time cost: 0{mns) 

• Theoretical guarantees 

1. When s = 0{m/e‘^), the subspace embedding property with 7 = 1 -|- e holds with 
high probability. 

2. When s = - - 1 - 1, the low-rank approximation property with rj = 1 + e holds in 
expectation [3]. 

• Advantages 
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1. Easy to implement: four lines of MATLAB code 

2 . C is a very high quality sketch of A 

• Disadvantages: 

1. High time complexity to perform matrix multiplication 

2 . Sparsity is destroyed: C is dense even if A is sparse 

3.2.2 Subsampled Randomized Hadamard Transform (SRHT) 

The Subsampled Randomized Hadamard Transform (SRHT) matrix is dehned by S = 
^!=DH„P, where 

• D e l^nxn jg diagonal matrix with diagonal entries sampled uniformly from {+ 1 , — 1 }; 

• G l^nxn jg (jggned recursively by 


Hn = 

H„/2 

H„/2 

and 

H 2 = 

■ +1 +1 ■ 


H„/2 

1 

3 

to 



1 - 

+ 

1 

1_ 


For all y G M", the matrix vector product can be performed in 0{n\ogn) time 

by the fast Walsh-Hadamard transform algorithm in a divide-and-conquer fashion; 

• P G samples s from the n columns. 

SRHT can be implemented in nine lines of MATLAB code below. Notice that this imple¬ 
mentation of SRHT is has 0{mN\ogN) [N > u is a power of two) time complexity, which 
is not efficient. 


1 function [C] = srht (A, s) 

2 n = size (A, 2 ); 

3 sgn = randi( 2 , [ 1 , n]) * 2 — 3 ; % one half are +1 and the rest are —1 
4 A = bsxfun (©times , A, sgn); % flip the signs of each column w.p. 50 % 

5 n = 2"(ceil(log2(n))); 

6 C = (fwht(A’ , n)) % fast Walsh—Hadarmard transform 

7 idx = sort ( randsample (n , s)); 

8 C = C(: , idx) ; % subsampling 
9 C=C * (n / sqrt(s)); 


The SRHT matrix has the following properties: 

• Time complexity: the matrix product AS can be performed in 0{mn log s) time, which 
makes SRHT more efficient than Gaussian projection. (Unfortunately, the MATLAB 
code above does not have such low time complexity.) 

• Theoretical property: when s = 0{e~‘^{m -|- logn) logm), SRHT satishes the subspace 
embedding property with 7 = 1 -|- e holds with probability 0.99 [34, Theorem 7]. 
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Example: 

• Matrix size 9x15 

• Sketch size s = 3 


(a) Hash each column with a value uniformly sampled from [s] = {1,2,3}. 



11111 11 




-1 -1 -1 +1 -1 -1 



-1 +1 


sum 


sum 


sum 


12 3 



(b) Flip the sign of each column with probability 50%, and then sum up columns with the same 
hash value. 


Figure 3.1: Count sketch in the map-reduce fashion. 

3.2.3 Count Sketch 

Count sketch stems from the data stream literature [4; 26]. It was applied to speedup matrix 
computation by [6; 21]. We describe in the following the count sketch for matrix data. 

There are different ways to implementing count sketch. This paper describe two quite 
different ways and refer to them as “map-reduce fashion” and “streaming fashion”. Of 
course, the two are equivalent. 

• The map-reduce fashion has three steps. First, hash each column with a discrete value 
uniformly sampled from [s]. Second, flip the sign of each column with probability 50%. 
Third, sum up columns with the same hash value. This procedure is illustrated in 
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Algorithm 3.1 Count Sketch in the Streaming Fashion. 

1: input: A G 

2 : Initialize C to be an m x s all-zero matrix; 

3: for i = 1 to n do 

4: sample I from the set [s] uniformly at random; 

5: sample g from the set —1} uniformly at random; 

6 : update the l-th column of C by c-j i — ca + gai,i] 

7: end for 
8: return C G 


Figure 3.1. As its name suggests, this approach naturally hts the map-reduce systems. 

• The streaming fashion has two steps. First, initialize C to be the m x s all-zero 
matrix. Second, for each column of A, flip its sign with probability 50%, and add it 
to a uniformly selected column of C. It is described in Algorithm 3.1 an illustrated 
in Figure 3.2. It can be implemented in 9 lines of MATLAB code as below. The 
streaming fashion implementation keeps the sketch C in memory and scans the data 
A in only one pass. If A does not £t in memory, this approach is better than the map- 
reduce fashion for it scans the columns sequentially. If A is sparse matrix, randomly 
accessing the entries may not be efficient, and thus it is better to accessing the column 
sequentially. 


1 
2 

3 

4 

5 

6 

7 

8 
9 

The readers may have noticed that count sketch does not explicitly form the sketching 
matrix S. In fact, S is such a matrix that each row has only one nonzero entry. In the 
example of Figure 3.1, the matrix can be explicitly expressed as 


fun 

ction [C] = CountSketch 

(A, 

s) % 

the 

streaming fash 

ion 

[m, 

n] = size (A) ; 








sgn 

= randi (2 , [1 , 

n] 

* 2 

— 

3; % one 

half are -|-1 and 

the rest are —1 

A = 

bsxfun (©times , 

A, 

sgn) 

; % 

flip 

the 

signs of each 

column w.p. 50% 

11 = 

= randsample (s , 

n, 

true 

) ; % sam 

pie 

n items from [s 

] with replacement 

C = 

zeros (m, s); % 

i n 

i t i a 1 

i z e 

C 




for 

j = 1: n 









C(: , ll(j)) = C(: 

11( 

j)) 

+ A( 

. j 

); 


end 












0 0 1 0 1 -1 1 -1 -1 1 0 0 0 0 0 

-1 0 0 -1 0 0 0 0 0 0 0 -1 1 -1 -1 

0 -1 000000 001000 0 


Count sketch has the following properties: 

• Time cost: (9(nnz(A)) 

• Memory cost: 0{ms). When A does not £t in memory, the algorithm keeps only C in 
memory and goes one pass through the columns of A. 
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Flip the Sign with Randomiy add to 




Figure 3.2: Count sketch in the streaming fashion. 

• Theoretical guarantees 

1. When s = Oijn?/e^), the subspace embedding property holds with 7 = 1 + e with 
high probability. 

2. When s = 0{k/e + k‘^), the low-rank approximation property holds with t] = 1 + e 
relative error with high probability. 

• Advantage: the count sketch is very efficient, especially when A is sparse. 

• Disadvantage: compared with Gaussian projection, the count sketch requires larger s 
to attain the same accuracy. One simple improvement is to combine the count sketch 
with Gaussian projection or SRHT. 

3.2.4 GaussianProjection + CountSketch 

Let Ssc be n X Scs count sketch matrix, be Scs x s Gaussian projection matrix, and 
S = ScsSgp G Then S satisfies the following properties. 

• Time complexity: the matrix product AS can be computed in 

0[ nnz(A) + mScsS ) 

count sketch Gaussian projection 


time. 

• Theoretical properties: 
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1. When Scs = 0{rn?/e^) and s = 0{m/e^), the GaussianProjection+CountSketch 
matrix S satisfy the subspace embedding property with 7 = 1 + e holds with high 
probability. 

2. When Scs = Oik"^ + k/e) and s = 0{k/e), the GaussianProjection+CountSketch 
matrix S satishes the low-rank approximation property with 7 = 1 + e [ 2 , Lemma 
12 ], 

• Advantages: 

1. the size of GaussianProjection+CountSketch is as small as Gaussian projection. 

2. the time complexity is much lower than Gaussian projection when n 3 > m. 

3.3 Column Selection 

This section presents three column selection techniques: uniform sampling, leverage score 
sampling, and local landmark selection. Different from random projection, column selection 
do not have to visit every entry of A, and column selection preserves the sparsity/non¬ 
negativity properties of A. 

3.3.1 Uniform Sampling 

Uniform sampling is the most efficient way to form a sketch. The most important advantage 
is that uniform sampling forms a sketch without seeing the whole data matrix. When applied 
to kernel methods, uniform sampling avoids computing every entry of the kernel matrix. 

The performance of uniform sampling is data-dependent. When the leverage scores (de- 
hned in Section 3.3.2) are uniform, or equivalently, the matrix coherence (namely the greatest 
leverage score) is small, uniform sampling has good performance. The analysis of uniform 
sampling can be found in [12; 13]. 

3.3.2 Leverage Score Sampling 

Before studying leverage score sampling, let’s hrst dehne leverage scores. Let A be an m x n 
matrix, with p = rank(A) < n, and V G be the right singular vectors. The (column) 
leverage scores of A are dehned by 

h ■= llvjJIs, for i = 1 , • • • ,n. 

Leverage score sampling is to select each columns of A with probability proportional to its 
leverage scores. (Sometimes each selected column should be scaled by If can be 

roughly implemented in 8 lines MATLAB code. 
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1 function [C, idx] = LeverageScoreSampling (A, s) 

2 n = size (A, 2); 

3 ~ , V] = svd(A, ’econ’); 

4 leveragescores = sum(VA2, 2); 

5 prob = leveragescores / sum( leveragescores); 

6 idx = randsample (n, s, true, prob); 

7 idx = unique(idx); % eliminate duplicates 

8 C = A(: , idx) ; 


There are a few things to remark: 

• To sample columns according to the leverage scores of where A: m, n, Line 3 can 
be replaced by 

si [' , ' , V] = svds (A, k) ; 


• Theoretical properties 

1. When s = 0{m/e + mlogm), the leverage score sampling satisfies the subspace 
embedding property with 7 = 1 + e holds with high probability. 

2. When s = 0{k/e + k\ogk), the leverage score sampling (according to the leverage 
scores of A^) satisfies the low-rank approximation property with rj = 1 + e. 

• Computing the leverage scores is as expensive as computing SVD, so leverage score 
sampling is not a practical way to sketch the matrix A itself. 

• When the leverage scores are near uniform, there is little difference between uniform 
sampling and leverage score sampling. 


3.3.3 Local Landmark Selection 

Local landmark selection is a very effective heuristic for hnding representative columns. 
Zhang and Kwok [36] proposed to set k = s and run fc-means or /c-centroids clustering 
algorithm to cluster the columns of A to s class, and use the s centroids as the sketch of A. 
This heuristic works very well in practice, though it has little theoretical guarantee. 

There are several tricks to make the local landmark selection more efficient. 

• One can simply solve fc-centroids clustering approximately rather than accurately. For 
example, it is unnecessary to wait for /c-centroids clustering to converge; running k- 
centroids for a few iterations suffices. 

• When n is large, one can uniformly sample a subset of the data, e.g. max{0.2n, 20s} 
data points, and perform local landmark selection on this smaller dataset. 
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• In supervised learning problems, each datum a* is associated with a label Hi. We can 
partition the data to g groups according to the labels and run /c-centroids clustering 
independently on the data in each group. In this way, s = gk data points are selected 
as a sketch of A. 
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Chapter 4 
Regression 


Let A be an n X (i (n > d) matrix whose rows correspond to data and colnmns correspond 
to featnres, and let b G M” contain the response/label of each datum. The least squares 
regression (LSR) 

min IIAx — b ||2 (4.1) 

X 

is a ubiquitous problem in statistics, computer science, economics, etc. When n ^ d, LSR 
can be efficiently solved using randomized algorithms. 


4.1 Standard Solutions 

The least squares regression (LSR) problem (4.1) has closed form solution 

X* = A^b. 

The Moore-Penrose inverse can be computed by SVD which costs 0{nd?) time. 

LSR can also be solved by numerical algorithms such as the conjugate gradient (CG) 
algorithm, and machine-precision can be attained in a reasonable number of iterations. Let 
k(A) := be the condition number of A. The convergence of CG depends on k(A): 

l|A(x<‘l-x»)||i / K(A)-i y 

IIA(x(o) — x*)||2 ~ \k(A)-|-1/ ’ 

where x^*^ is the model in the t-th iteration of CG. The per-iteration time cost of CG is 
(P(nnz(A)). To attain ||A(x*^*^ — x*)||| < e, the number of iteration is roughly 

log —h log(InitialError) j- - -. 

Since the time cost of GG heavily depends on the unknown condition number k(A), GG can 
be very slow if A is ill-conditioned. 
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4.2 Inexact Solution 


Any sketching matrix S G can be used to solve LSR approximately as long as it satisfies 
the subspace embedding property. We consider the following LSR problem: 

k = min II (S^A)x-S^b||2, (4.2) 

sxd 


which can be solved in 0{sd?) time. 

If S is a Gaussian projection matrix, SRHT matrix, count sketch, or leverage score 
sampling matrix, and s = poly((i/e) for any error parameter e G (0,1], then 

||Ax —b ||2 < (1 + e)^ min IIAx — b ||2 

X 


is guaranteed. 


4.2.1 Implementation 

If S is count sketch matrix, the inexact LSR algorithm can be implemented in 5 lines of 
MATLAB code. Here CountSketch is a MATLAB function described in Section 3.2.3. The 
total time cost is (9(nnz(A) + poly((i/e)) and memory cost is (9(poly(d/e)), which are lower 
than the cost of exact LSR when d <^n. 


1 function [xtilde] = InexactLSR (A, b, s) 

2 d = size (A, 2); 

3 sketch = (CountSketch ([A, b]’, s))’; 

4 Asketch = sketch (:, l:d); % Asketch = S’ * A 

5 bsketch = sketch (:, end ); % bsketch = S’ * b 

6 xtilde = Asketch \ bsketch; 


There are a few things to remark: 

• The inexact LSR is useful only when n = fl{d/e + d?). 

• The size of sketch s is a polynomial function of e~^ rather than logarithm of e“^, thus 
the algorithm cannot attain high precision. 


4.2.2 Theoretical Explanation 

By the subspace embedding property, it can be easily shown that x is a good solution. Let 
D = [A, b] G and z = [x; —1] G Then 

Ax — b = Dz and S'^Ax — S^b = S^Dz, 
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and the subspace embedding property indicates -||Dz ||2 < ||S^Dz ||2 < r 7 ||Dz||| for all z. 
Thus 

-||Ax-b ||2 < ||S^(Ax-b )||2 and ||S^(Ax* - b )||2 < r^HAx* - b ||2 
V 

The optimality of x indicates ||S^(Ax — b )||2 < ||S^(Ax* — b)|| 2 , and thus 

-||Ax-b ||2 < ||S^(Ax-b )||2 < ||S^(Ax* - b )||2 < r/||Ax* - b|| 2 . 

r] 

^||Ax-b||2 < ?7^||Ax* - b||2. 

Therefore, as long as S satishes the subspace embedding property, the approximate solution 
to LSR is nearly as good as the optimal solution (in terms of objective function value). 


4.3 Machine-Precision Solution 

Randomized algorithms can also be applied to find machine-precision solution to LSR, and 
the time complexity is lower than the standard solutions. The state-of-the-art algorithm [18] 
is based on very similar idea described in this section. 


4.3.1 Basic Idea: Preconditioning 


We have discussed previously that the time cost of the conjugate gradient (CG) algorithm 
is roughly 


k(A) — 1 


log —h log(InitialError)^nnz(A), 


which dependents on the condition number of A. To make CG efficient, one can find a. dx d 
preconditioning matrix T such that k(AT) is small, solve 


z* = argmin II (AT)z — b ||2 

Z 


(4.3) 


by CG, and let x* = Tz*. In this way, the time cost of CG is roughly 


k(AT) 


log - + 


log(InitialError)^nnz(A). 


If k(AT) is a small constant, e.g. fi;(AT) = 2, then (4.3) can be very efficiently solved by 
GG. 

Now let’s consider how to find the preconditioning matrix T. Let A = QaR-a be 
the QR decomposition. Obviously T = is a perfect preconditioning matrix because 
k(AR[^^) = ^(Qa) = 1- Unfortunately, the preconditioning matrix T = R[^^ is not a 
practical choice because computing the QR decomposition is as expensive as solving LSR. 

Woodruff [34] proposed to use sketching to find Ra approximately in (!l(nnz(A)-|-poly((i)) 
time. Let S G be a sketching matrix and form Y = S^A. Let Y = QyRy be the QR 
decomposition of Y. Theory shows that the sketch size s = O^d?) suffices for k(ARy^) < 2 
holding with high probability. Thus Ry^ G is a good preconditioning matrix. 
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Algorithm 4.1 Machine-Precision Solution to LSR. 

1: input: A G b G M"", and step size 6. 

2: Draw a sketching matrix S G where s = 0{(P)\ 

3: Form the sketch Y = S'^A G 

4: Compute the QR decomposition Y = QyRy; 

5: Compute the preconditioning matrix T = Ry^; 

6: Compute the initial solution = (S^AT)'^(S^b) = QY(S^b); 
7: for t = 1, - ■ ■ , 0{log€~^) do 
8: r*^*) = b — ATz(*“^) ; // the residual 

9: z^^'> = + 9T^II gradient descent 

10: end for 

11: return x* = G 


4.3.2 Algorithm Description 

The algorithm is described in Algorithm 4.1. We hrst form a sketch Y = S^A G and 
compute its QR decomposition Y = QyRy- We can use this QR decomposition to hnd the 
initial solution and the preconditioning matrix T = Ry^. If we set s = 0{(P), the initial 
solution is only constant times worse than the optimal in terms of objective function value. 
Theory also ensures that the condition number k(AT) < 2. With the good initialization 
and good condition number, the vanilla gradient descent^ or CG takes only C>(loge“^) steps 
to attain 1 -|- e solution. Notice that Lines 8 and 9 in the algorithm should be cautiously 
implemented. Do not compute the matrix product AT because it would take (P(nnz(A)(i) 
time! 


4.4 Extension: CX-Type Regression 

Given any matrix A G GX decomposition considers decomposing A into A CX*, 

where C G is a sketch of A and X* G is computed by 

X* = argmin ||A — CXlIp = C^A. 

It takes 0{mnc) time to compute X*. If c m, this problem can be solved more efficiently 
by sketching. Specihcally, we can draw a sketching matrix S G and compute the 

approximate solution 

X = argmin II = (S^C)t(S^A) 

^ sxc cxn sxn 

If S is a count sketch matrix, we set s = 0{cle -t- c^); if S samples columns according to the 
row leverage scores of C, we set s = 0{cle A clogc). It holds with high probability that 

||A-CX||^^ < (1 + e) min||A-CX||t. 

^ Since AT is well conditioned, the vanilla gradient descent and CG has little difference. 
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4.5 Extension: CUR-Type Regression 

A more complicated problem has also been considered in the literature [25; 29; 24]: 

X* = argmin||^^^-^||^ (4.4) 

^ nxc cxr rxn mxn 


where c,r m, n. The solution is: 

X* = C^AR^ 

which cost 0{mn ■ min{c, r}) time. Wang et al. [31] proposed an algorithm to solve (4.4) 
approximately by 

X = argmin||S5 (CXR-A)Sh||| 

X 

where Sc G and Sji G are leverage score sampling matrices. When Sc = c^Jq/e 

and Sr = r^Jq/e (where q = min{m,?7,}), it holds with high probability that 

||CXR- All ^ < (1 + e) min ||CXR — A||^. 

The total time cost is 

0{scSr ■ min{c, r}) = 0{cre~^ ■ min{m, n} ■ min{c, r}) 

time, which is useful when max{m,n} S> c, r. The algorithm can be implemented in 4 lines 
of MATLAB code: 


1 function [Xtilde] = InexactCurTypeRegression (C, R, A, sc, sr) 

2 [~ , idxC] = LeverageScoreSampling (C’ , sc); 

3 [~ , idxR] = LeverageScoreSampling (R, sr); 

4 Xtilde = pinv(C(idxC, :)) * A(idxC , idxR) * pinv(R(:, idxR)); 


Here the function “LeverageScoreSampling” is described in Section 3.3.2. Empirically, setting 
Si = S 2 = 0{di + d 2 ) suffices for high precision. The experiments in [31] indicates that 
uniform sampling performs equally well as leverage score sampling. 
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Chapter 5 

Rank k Singular Value Decomposition 


This chapter considers the fc-SVD of a large scale matrix A G which may not fit in 

memory. 


5.1 Standard Solutions 

The standard solutions to fc-SVD include the power iteration algorithm and the Krylov 
subspace methods. Their time complexities are considered to be 0{mnk), where the O 
notation hides parameters such as the spectral gap and logarithm of error tolerance. Here we 
introduce a simplified version of the block Lanczos method [19]^ which costs time 0{mnkq), 
where q = log ^ is the number of iterations, and the inherent constant depends weakly on the 
spectral gap. The block Lanczos algorithm is described in Algorithm 5.1 can be implemented 
in 18 lines of MATLAB code. 


1 

2 

3 

4 

5 

6 
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11 

12 

13 

14 

15 

16 


function [U, S, V] = BlockLanczos (A, k, q) 
s = 2 * k; % can be tuned 
[m, n] = size (A) ; 

C = A * randn (n, s); 

Krylov = zeros (m, s * q) ; 

Krylov (: , l:s) = C; 
for i = 2: q 

C = A’ * C; 

C = A * C; 

[C, '] = qr(C, 0); % optional 
Krylov (: , (i—l)*s+l: i*s) =C; 

end 

[Q, '] = qr(Krylov , 0) ; 

[Ubar, S, V] = svd(Q’ * A, ’econ’); 

Ubar = Ubar(:, l:k); 

S = S(l:k, l:k); 


^We introduce this algorithm because it is easy to understand. However, as q grows, columns of the 
Krylov matrix gets increasingly linearly dependent, which sometimes leads to instability. Thus there are 
many numerical treatments to strengthen stability (see the numerically stable algorithms in [23]). 
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Algorithm 5.1 /c-SVD by the Block Lanczos Algorithm. 
1: Input: an m X n matrix A and the target rank k. 

2: Set s = k + 0(1) he the over-sampling parameter; 

3: Set q = 0(log 7 ) be the number of iteration; 

4: Draw a n x s sketching matrix S; 

5: C = AS; 

6: SetK= [C, (AA^)C, (AA^)2C, ••• , (AA^)'?-iC]; 

7: QR decomposition: [ Qc , Rc] = 

mxsq mxsq 


i: SVD: = svdiQi^A); 

sqxsq sqxsq nxsq 


9: Retain the top k components of U, S, and V to form sq x k, k x k, n x k matrices; 
10 : U = QtJ e 

11: return UX)V^ Ri A^.. 


i7V = V(:, l:k); 
18 U = Q * Ubar; 


Although the block Lanczos algorithm can attain machine precision, it inevitably goes many 
passes through A, and it is thus slow when A does not £t in memory. 

Facing large-scale data, we must trade off between precision and computational costs. 
We are particularly interested in approximate algorithm that satisfies: 

1. The algorithm goes constant passes through A. Then A can be stored in large volume 
disks, and there are only constant swaps between disk and memory. 

2. The algorithm only keeps a small-scale sketch of A in memory. 

3. The time cost is 0{mnk) or lower. 

5.2 Prototype Randomized /c-SVD Algorithm 

This section describes a randomized algorithm that computes the /c-SVD of A up to 1 -|- e 
Frobenius norm relative error. The algorithm is proposed by [14], and it is described in 
Algorithm 5.2. 

5.2.1 Theoretical Explanation 

If C = AS G is a good sketch of A, the column space of C should roughly contain 

the columns of A^,—this is the low-rank approximation property. If S G is Gaussian 
projection matrix or count sketch and s = 0{k/e), then the low-rank approximation property 

min ||CZ-A||| < (1-b e)||A - Afc||| (5.1) 

rank(Z)</c 
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Algorithm 5.2 Prototype Randomized fc-SVD Algorithm. 
1: Input: an m X n matrix A and the target rank k. 

2: Draw a. n x s sketching matrix S where s = 

3: C = AS; 

4: QR decomposition: 1^0^, Rc] = 

mxs mxs 


5: A:-SVD: = svds{Q^A,k); 

sxk kxk nxk sxn 

6: U = QcU E 
7: return ss A^.. 


holds in expectation. 


5.2.2 Algorithm Derivation 


Let Qc be any orthonormal bases of C. Since the colnmn space of C is the same to the 
colnmn space of Qc, the minimization problem in (5.1) can be eqnivalently converted to 


X* 


argmin II Qc X 


rank(X)<fc 



mxn 


(QgA),. 


(5.2) 


Here the second eqnality is a well known fact. The matrix A^ is well approximated by 
Afc := QcX*, so we need only to hnd the fc-SVD of Aa,: 


A, := Qc. X^ = Qc(QcA)fc = QcUSV^ = 

mxk kxk kxn 


:=USVT’ 


;=U 


It is easy to check that U and V have orthonormal columns and S is a diagonal matrix. 
Notice that the accuracy of the randomized fc-SVD depends only on the quality of the sketch 
matrix C. 


5.2.3 Implementation 

The algorithm is described in Algorithm 5.2 and can be implemented in 5 lines of MATLAB 
code. Here s = 0{^) is the size of the sketch. 

1 function [Utilde, Stilde , Vtilde] = ksvdPrototype (A, k, s) 

2 C = CountSketch (A, s) ; 

3 [Q, R] = qr(C, 0) ; 

4 [Ubar, Stilde , Vtilde] = svds (Q’ * A, k) ; 

5 Utilde = Q * Ubar; 


Empirically, using “svd(Q' * A, 'econ')” followed by discarding the fc + 1 to s components 
should be faster than the “svds” function in Line 4. 

The algorithm has the following properties: 
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1. The algorithm goes 2 passes through A; 

2. The algorithm only keeps an m x (9(^) sketch C in memory; 


3. The time cost is 0{miz{A)k/e). 

5.3 Faster Randomized /c-SVD 

The prototype algorithm spends most of its time on solving (5.2); if (5.2) can be solved more 
efficiently, the randomized fc-SVD can be even faster. The readers may have noticed that 
(5.2) is the least squares regression (LSR) problem discussed in Section 4.4. Yes, we can 
solve (5.2) efficiently by the inexact LSR algorithm presented in the previous section. 

5.3.1 Theoretical Explanation 

Now we draw a m x p GaussianProjection+CountSketch matrix P and solve this problem: 



(5.3) 


To understand this trick, the readers can retrospect the extension of LSR in Section 4.4. Let 



mxpcs PcsXp 


where Pcs = 0{k/e+k‘^) andp = 0{k/e). The subspace embedding property of RSHT+CountSketch 
[6, Theorem 46] implies that 


(l + e)-lQcX-A||| < ||P'^(QcX-A)||| < ||P'^(QcX* - A)||| < (1 + 6)||QcX* - A||| 


^ ||QcX-A||^ < (l + e)lQcX*-A||^ < (1 + e)3||A - Afc||| 


Here the second inequality follows from the optimality of X, and the last inequality follows 
from the low-rank approximation property of the sketch C = AS. Thus, by solving (5.3) we 
get fc-SVD up to 1 -|- 0(e) Frobenius norm relative error. 

5.3.2 Algorithm Derivation 

The faster randomized /c-SVD is described in Algorithm 5.3 and derived in the following. 
The algorithm solves 



(5.4) 


to obtain the rank k matrix X G and approximates A^ by 


Algorithm 5.3 Faster Randomized fc-SVD Algorithm. 


Input: an m X n matrix A and the target rank k. 

Set the parameters as s = 0{^), pcs = log® f + f , and p = f log f ; 
Draw a n X s count sketch matrix S and perform sketching: C = AS; 
Draw an m X pcs count sketch matrix Pcs and an pcs x p matrix Psrht] 
Perform Sketching: D = Pf^^^P^^C e and L = Pj^^^P^,A E 
QR decomposition: [Qd , RdJ = 

pxs 


pxs sxs 


7 : A:-SVD: [ U , S , V ] = svds{Q^L, k); 


sxk kxk nxk 


8 : SVD: = snd(CRi)US); 

nxk kxk kxk 

9 : V 

nxk kxk 

10 : return Ri Au. 


sxk 


Dehne D = P^C, L = P^A, and let QdH-d = D be the QR decomposition. Then (5.4) 
becomes 

X = argmin || —||^ = (QqL)^ . 

rank(X)<fc 

Based on the dehned notation, we decompose A^ Ri CX by 

A nv — nT^t — nr?! TTx-'Ar'r _ 


fc ~ CX = CRij(Q^)L)fc = CRijUSV = USV^V^ = JJ S V' 

~V^ mxfc kxk kxn 


:=USVr 


:=USVr 


5.3.3 Implementation 

The faster randomized fc-SVD is described in Algorithm 5.3 and implemented in 18 lines of 
MATLAB code. 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 


function [Utilde, Stilde , Vtilde] = ksvdFaster (A, k, s, pi, p2) 
n = size (A, 2) ; 

C = CountSketch (A, s); 

A= [A, C]; 

A = A’; 

sketch = CountSketch (A, pi); 
clear A % A (m-by—n) will not be used 
sketch = GaussianProjection ( sketch , p2); 
sketch = sketch ’; 

L = sketch (: , 1:n) ; 

D= sketch (:, n+l:end); 

clear sketch % sketch (p2—by—(n+c)) will not be used 
[QD, RD] = qr(D, 0); 
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[Ubar, Sbar , Vbar] = svds (QD’ * L, k) ; 
clear L % L (p2—by—n) will not be used 
C = C * ( pinv (RD) * (Ubar * Sbar) ) ; 
[Utilde, Stilde, Vhat ] = svd (C, ’ econ ’ ) ; 

Vtilde = Vbar * Vhat; 


There are a few things to remark: 

1. The algorithm goes only two passes throngh A. 

2. The algorithm costs time 0{m\z[A) + (m + n)poly(/c/e)). 

3. The parameters should be set as fc < s < p2 < pi <C m, n. 

4. Line 8 can be removed or replaced by other sketching methods. 

5. “A”, “sketch”, and “L” are the most memory expensive variables in the program, but 
fortunately, they are swept only one or two passes. If “A”, “sketch”, and “L” do not 
fit in memory, they should be stored in disk and loaded to memory block-by-block to 
perform computations. 

6. Unless both m and n are large enough, this algorithm may be slower than the prototype 
algorithm. 
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Chapter 6 

SPSD Matrix Sketching 


This chapter considers SPSD matrix K G which can be a kernel matrix, a social 

network graph, a Hessian matrix, or a Fisher information matrix. Onr objective is to find 
a low-rank decomposition K LL^. (Notice that LL^ is always SPSD, no matter what L 
is.) If K is symmetric bnt not SPSD, it can be approximated by K CZC^ where Z is 
symmetric bnt not necessarily SPSD. 


6.1 Motivations 

This section provides three motivation examples to show why we seek to sketch K by K 

LL^ or K ^ CZC^. 

6.1.1 Forming a Kernel Matrix 

In the kernel approximation problems, we are given 

• an n X d matrix X, whose rows are data points xi, ■ ■ ■ , x„ G M'^, 

• a kernel fnnction, e.g. the Ganssian RBF kernel fnnction defined by 

K(xi,xj) = exp (^ - 

where a > 0 is the kernel width parameter. 

The RBF kernel matrix can be compnted by the following MATLAB code: 


fun 

ction 

[K] = 

rbf (XI, 

X2 

, sigma) 

K = 

XI * 

X2’; 




Xl_ 

row_sq 

= sum 

(XI.‘2, 

2 ) 

/ 2; 

X2_ 

row_sq 

— sum 

(X2.‘2, 

2 ) 

/ 2; 

K = 

bsxfu 

n(©minus, K, 

XI 

_row_sq) ; 

K = 

bsxfu 

n(©minus, K, 

X2 

_row_sq ’) ; 

K = 

K / ( 

sigma" 

2 ); 



K = 

exp(K) ; 
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If Xi and X 2 are respectively rii x d and n 2 x d matrices, then the output of “rbf” is an 
Til X n2 matrix. 

Kernel methods requires forming the n x n kernel matrix K whose the (i, j)-th entry is 
K(xi,Xj). The RBF kernel matrix can be computed by the MATLAB function 

1 K = rbf(X, X, sigma) 
in 0{'n?d) time. 

In presence of millions of data points, it is prohibitive to form such a kernel matrix. 
Fortunately, a sketch of K can be obtained very efficiently. Let S G be a uniform 

column selection matrix^ described in Section 3.3, then C = KS can be obtained in 0{nsd) 
time by the following MATLAB code. 


1 function [C] = rbfSketch(X, sigma, s) 

2 n = size (X, 1); 

3 idx = sort ( randsample (n , s)); 

4 G = rbf(X, X(idx, :), sigma) ; 


6.1.2 Matrix Inversion 

Let K be an n X n kernel matrix, y be an n dimensional vector, and a be a positive constant. 
Kernel methods such as the Gaussian process regression (or the equivalent the kernel ridge 
regression) and the least squares SVM require solving 

(K + Q!l„)w = y 

to obtain w G M”. The exact solution costs 0{n^) time and Cl(n^) memory. 

If we have a rank I approximation K LL^, then w can be approximately obtained in 
0{nl‘^) time and 0{nl) memory. Here we need to apply the Sherman-Morrison-Woodbury 
matrix identity 

(A + BCD)-^ = A"^-A-^B(C-^ + DA-iB)-iDA-^ 

We expand (LL^ + al„)“^ by the above identity and obtain 

(LL^ + aI„)-^ = a-^I„-a-^L(aIi + L^L)-^L^, 

Ixl 


and thus 

w = (K + aI„)“V ~ ~ + L'^L)“^L'^y. 

The matrix inversion problem not only appears in the kernel methods, but also in the 
second order optimization problems. Newton’s method and the so-called natural gradient 

Mhe local landmark selection is sometimes a better choice. Do not use random projections, because they 
inevitably visit every entry of K. 
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method require computing H^^g, where g is the gradient and H is the Hessian matrix or 
the Fisher information matrix. Since low-rank matrices are not invertible, the naive low- 
rank approximation H CZC^ does not work. To make matrix inversion possible, one 
can use the spectral shifting trick of [30]: £x a small constant a > 0, form the low-rank 
approximation H — al„ ~ CZC^, and compute H'^g (CZC^ -|- a;I„)“^g. Besides the 
low-rank approximation approach, one can approximate H by a block diagonal matrix or 
even its diagonal, because it is easy to invert a diagonal matrix or a block diagonal matrix. 

6.1.3 Eigenvalue Decomposition 

With the low-rank decomposition K LL^ at hand, we hrst approximately decompose K 
by 

K ^ LL^ = (UlSlV^])(UlSlV?;)^ = UlS^U^, 

and then discard the A: -|- 1 to / components in Ul and El. Here L = UlSlVl is the SVD 
of L, which can be obtained in 0{nt^) time and 0{nl) memory. In this way, the rank k 
{k < rank(L)) eigenvalue decomposition is approximately computed. 


6.2 Prototype Algorithm 

From now on, we will consider how to hnd the low-rank approximation K cs LL^. As usual, 
the simplest approach is to form a sketch C = KS G and solve 

X* = min||K-CXC^||| = C'fK(C'^)^ or Z* = min ||K - QcZQcH?. = QcKQc, 
X z 

( 6 . 1 ) 

where Qc is the orthonormal bases of C computed by SVD or QR decomposition. It is 
obvious that CX*C = QcZ*Qq. In this way, a rank c approximation to K is obtained. 
This approach is hrst studied by [14]. Wang et al. [30] showed that if C contains s = 0{k/e) 
columns of K chosen by adaptive sampling, the error bound 

E||K-QcZ*Q^||^ < (l + e)||K-Kfc||| 

is guaranteed. Other sketching methods can also be applied, although currently they do not 
have 1 -|- e error bound. In the following we implement the prototype algorithm (with the 
count sketch) in 5 lines of MATLAB code. Since the algorithm goes only two passes through 
K, when K does not £t in memory, we can store K in the disk and keep one block of K in 
memory at a time. In this way, 0{ns) memory is enough. 


function [QC, Z] = 
n = size (K, 2) ; 

C = CountSketch (K, 
[QC, '] = qr(C, 0); 
Z = QC’ * K * QC; 


spsdPrototype(K, s) 

s); 
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Algorithm 6.1 Faster SPSD Matrix Sketching. 


Input: an n X n matrix K and integers s and p {s <p <^n) 
Draw a column selection matrix S G 
Perform sketching; C = AS; 

QR decomposition: [Qc,Rc] = ^^(C); 

Draw a column selection matrix P G 
Compute Z = (P^Qc)'^(P^KP)(QgP)t; 
return Q^ZQ^ A. 


Despite its simplicity, the algorithm has several drawbacks. 

• The time cost of this algorithm is 0{ns^ + nnz(K)s), which can be qnadratic in n. 

• The algorithm must visit every entry of K, which can be a serious drawback when 
applied to kernel methods. It is because computing the kernel matrix K costs 0{n^d) 
time, where d is the dimension of the data points. 

Therefore, we are interested in computing a low-rank approximation in linear time (w.r.t. n) 
and avoiding visiting every entry of K. 

6.3 Faster SPSD Matrix Sketching 

The readers may have noticed that (6.1) is the problem studied in Section 4.5. We can thus 
draw a column selection matrix P G and approximately solve (6.1) by 

Z = mm||P^(K-QcZQc)P||| = (P^^^Qc^^- (6-2) 

sxp pxp 

Then we can approximate K by QcZQ|). We describe the faster SPSD matrix sketching in 
Algorithm 6.1. 

There are a few things to remark. 

• Since we are trying to avoid computing every entry of K, we should use uniform 
sampling or local landmark selection to form C = KS. 

• Let P G be a leverage score sampling matrix according to the columns of C^. 

That is, it samples the Tth column with probability proportional to g*, where g* is the 
squared i.2 norm of the Tth row of Qc (for z = 1 to n). When p = the 

following error bounds holds with high probability [31] 

||K-QcZQc||| < (1 + e)min ||K - QcZQcll^. 

z 

• Let S be a uniform sampling matrix and P be a leverage score sampling matrix. The 
algorithm visits only ns+ p^ = 0{n) entries of K. The overall time and memory costs 
are linear in n. 
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• Assume S is a column selection matrix. Let the sketch C = KS contains the columns 
of K indexed by 5 C [n], and the columns selected by P are indexed by P C [n\. 
Empirically, enforcing S <ZV signihcantly improves the approximation quality. 


• Empirically, letting p be several times larger than s, e.g. p = 4s, is sufficient for a high 
quality. 


The algorithm can be implemented in 12 lines of MATLAB code. 


1 function [QC, Z] = spsdFaster (K, s) 

2 p = 4 * s; % can be tuned 

3 n = size (K, 2); 

4 S = sort (randsample (n, s)); % uniform sampling 

5 C = K(: , S) ; 

6 [QC, '] = qr(C, 0); 

7 q = sum(QC."2, 2); % the sampling probability 

8 q = q / sum(q) ; 

gP = randsample (n, p,true, q) ; % leverage score sampling 

10 P = unique ([P; S]) ; % enforce P to contain S 

11 PQCinv = pinv(QC(P, :) ) ; 

12 Z = PQCinv * K(P, P) * PQCinv’; 


The above implementation assumes that K is a given matrix. In the kernel approximation 
problems, we are only given a n x d matrix X, whose rows are data points, and a kernel 
function, e.g. the RBF kernel with width parameter a. We should implement the faster 
SPSD sketching algorithm in the following way. 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 


function [QC, Z] = spsdFaster (X, sigma, s) 
p = 4 * s; % can be tuned 
n = size (X, 1) ; 

S = sort (randsample (n, s)); % uniform sampling 
C = rbf(X, X(S, :) , sigma) ; 

[QC, '] = qr(C, 0); 

q = sum(QC."2, 2); % the sampling probability 
q = q / sum(q) ; 

P = randsample (n, p,true, q); 

P = unique ([P; S]) ; % enforce P contains S 
PQCinv = pinv(QC(P, :)); 

Ksub = rbf(X(P, :) , X(P, :) , sigma); 

Z = PQCinv * Ksub * PQCinv ’; 


The above implementation avoids computing the whole kernel matrix, and is thus highly 
efficient when applied to kernel methods. 
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6.4 The Nystrom Method 


Let S be an n X s column selection matrix and C = KS G be a sketch of K. Recall the 
model (6.2) proposed in the previous section. It is easy to verify that QcZQq = CXC^, 
where X is dehned by 

X = nnn||P^(K - CXC)P||| = (P^C)'^ (P^KP) (C^P)t. 

sxp pxp 

One can simply set P = S G and let W = S^C = S^KS. Then the solution X 

becomes 

X = (S^C)t(S^KS)(C^S)^ = W+WWt = wb 
The low-rank approximation 

K ^ CW^C^ 

is called the Nystrom method [20; 32]. The Nystrom method is perhaps the most extensively 
used kernel approximation approach in the literature. See Figure 6.1 for the illustration of 
the Nystrom method. 



nXn 




s Xs 



—I— 

sXn 


Figure 6.1: The illustration of the Nystrom method. 

There are a few things to remark: 

• The Nystrom is highly efficient. When applied to speedup kernel methods, the scala¬ 
bility can be as large as n = 10®. 

• The Nystrom method is a rough approximation to K and is well known to be of low 
accuracy. If a moderately high accuracy is required, one had better use the method in 
the previous section. 

• The s X s matrix W is usually ill-conditioned, and thus the Moore-Penrose inverse 
can be numerically instable. (It is because the bottom singular values of W blow up 
during the Moore-Penrose inverse.) A very effective heuristic is to drop the bottom 
singular values of W: set a parameter k < s, e.g. k = [0.8s], and approximate K by 
C(Wfc)tC^. 
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• There are many choices of the sampling matrix S. See [13] for more discussions. 

The Nystrom method can be implemented in 11 lines of MATLAB code. The output of the 
algorithm is L G where LL^ is the Nystrom approximation to K.^ 


1 

function [L] = Nystrom(X, sigma, s) 

2 

k = ceil (0.8 * s); % can be tuned 

3 

n = size (X, 1) ; 

4 

S = sort (randsample (n, s)); % uniform sampling 

5 

C = rbf(X, X(S, :) , sigma) ; % C = K(: , S) 

6 

CO 

II 

7 

[UW, SW, ~] = svd(W); 

8 

SW = diag (SW) ; 

9 

SW = 1 ./ sqrt (SW(l:k) ) ; 

10 

UW = bsxfun (©times , UW(: , 1: k) , SW’) ; 

11 

L = C * UW; % K is approximated by L * L’ 


Here we use the RBF kernel function implemented in Section 6.1. Line 8 sets k = [0.8c], 
which can be better tuned to enhance numerical stability. Notice that k should not be set 
too small, otherwise the accuracy would be affected. 


6.5 More Efficient Extensions 

Several SPSD matrix approximation methods has been proposed recently, and they are more 
scalable than the Nystrom method in certain applications. This section briefly describes some 
of these methods. 


6.5.1 Memory Efficient Kernel Approximation (MEKA) 

MEKA [24] exploits the block structure of kernel matrices and is more memory efficient 
than the Nystrbm method. MEKA hrst partitions the data xi, • ■ ■ , x„ into b groups (e.g. by 
inexact fcmeans clustering), accordingly, the kernel matrix K has b x b blocks: 


-1 

I 


■ K[l:] ■ 

. . 

_1 

1 


. . 


Then MEKA approximately computes the top left singular vectors of Kjip • ■ ■ , K[fe:], denote 
U[i], ■ ■ ■ , U[fe], respectively. For each {i,j) G [6] x [6], MEKA Ends a very small-scale matrix 
Z[ij] by solving 

Z[i,i] = argmin||K[ij]-U[i]Z[ij]Uj]||^. 

z 

^Let Wfe = Uw,fcKw,feU^ ^ be the fc-eigenvalue decomposition of W and set L = CUw.fcA:^ ^ S 
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This can be done efficiently using the approach in Section 4.5. Finally, the low-rank approx¬ 
imation is 


K 


U[i] 

0 


0 

U[.] 


[1,1] • ■ 

■ Z[i,b] 

[Ml ■ ■ 

■ '^[b,b] 


U[1] 

0 


0 

U[.] 


-1 T 


= UZUffi 


Since Z and Up],-- - ,U[b] are small-scale matrices, MEKA is thus very memory efficient. 
There are several things to remark: 


• MEKA can be used to speedup Gaussian process regression and least squares SVM. 
However, MEKA can be hardly applied to speedup ^-eigenvalue decomposition, be¬ 
cause it requires the fc-SVD of UZ^/^, which destroys the sparsity and significantly 
increases memory cost. 

• Indiscreet implementation, e.g. the implementation provided by [24], can make MEKA 
numerically unstable, as was reported by [30; 28]. The readers had better to follow the 
stabler implementation in [28]. 


6.5.2 Structured Kernel Interpolation (SKI) 

SKI [33] is a memory efficient extension of the Nystrom method. Let S be a column selection 
matrix, C = KS, and W = S^C = S^KS. The Nystrom method approximates K by 
CW^C^. SKI further approximates each row of C by a convex combination of two rows of 
W and obtain C ~ XW. Notice that each row of X has only two nonzero entries, which 
makes X extremely sparse. In this way, K is approximated by 

K ^ CW^C ^ (XW)Wf(XW)'^ = XWX^. 

Much accuracy is lost in the second approximation, so SKI is much less accurate than the 
Nystrom method. For the same reason as MEKA, there is no point in applying SKI to 
speedup /c-eigenvalue decomposition of K. 


6.6 Extension to Rectangular Matrices: CUR Matrix 
Decomposition 

This section considers the problem of sketching any rectangular matrix A by the CUR 
matrix decomposition [16]. The CUR matrix decomposition is an extension of the previously 
discussed SPSD matrix sketching methods. 

6.6.1 Motivation 

Suppose we are given n training data xi, • • ■ , x„ G m test data x'^^, • • ■ , x.'^ E M'^, and a 
kernel function k{-, •). In their generalization (test) stage, kernel methods such as GPR and 
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KPCA form an m x n matrix K*, where (K*)jj = «:(x',Xj), and apply K* to some vectors 
or matrices. Notice that it takes 0{mnd) time to form K* and 0{mnp) time to mnltiply K* 
by an n X p matrix. If m is as large as n, the generalization stage of snch kernel methods 
can be very expensive. Fortnnately, with the help of the CUR matrix decomposition, the 
generalization stage of GPR or KPCA merely costs time linear in m + n. 

6.6.2 Prototype CUR Decomposition 

Suppose we are given an arbitrary mxn rectangular matrix A, which can be the aforemen¬ 
tioned K*. We sample c columns of A to form C = ASc G sample r rows of A to 

form R = ASr G and compute the intersection matrix U* G by solving 

U* = argminll^-^^^lll = C^AR^ (6.3) 

^ mxn mxc cxr rxn 

The approximation A CU*R is well known as the CUR decomposition [16]. This for¬ 
mulation bears a strong resemblance with the prototype SPSD matrix sketching method in 
( 6 . 1 ). 

The prototype CUR decomposition is not very useful because (1) its time cost is 0{mn ■ 
min{c, r}) and (2) it visits every entry of A. 

6.6.3 Faster CUR Decomposition 

Analogous to the SPSD matrix sketching, we can compute U* approximately and signih- 
cantly more efficiently. Let Pc G and Pr G be some column selection matrices. 

Then we solve this problem in stead of (6.3): 

U = argminllP^APR-PgC^R^III = (P^C)t(PgAPR)(RPR)t. (6.4) 

PcX-Pr PcXC rXpr 

The faster CUR decomposition is very similar to the faster SPSD matrix sketching method 
in Section 6.3. The faster CUR decomposition has the following properties: 

• It visits only me + nr + pePr entries of A, which is linear in m -|- n. This is particularly 
useful when applied to kernel methods, because it avoids forming the whole kernel 
matrix. 

• The overall time and memory costs are linear in m -|- n. 

• If Pc is the leverage score sampling matrix corresponding to the columns of and 
Pr is the leverage score sampling matrix corresponding to the columns of R, then U 
is a very high quality approximation to U* [31]: 

||A-CUR||| < (1+ e)niin||A - CURIII 

holds with high probability. 
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Empirically speaking, setting Pc and Pr be nniform sampling matrices works nearly as 
well as leverage score sampling matrices, and setting Pc = Pr = 0{c + r) snfiices for a high 
approximation quality. If A is a full-observed matrix, the CUR matrix decomposition can 
be computed by the following MATLAB code. 


1 

function [C, U, R] = curFaster(A, c, r) 

2 

pc = 2 * (r -1- c) ; % can be tuned 

3 

pr = 2 * (r -1- c) ; % can be tuned 

4 

[m, n] = size (A) ; 

5 

SC = sort ( randsample (n , c)); 

6 

SR = sort ( randsample (m, r)); 

7 

C = A(: , SC) ; 

8 

R = A(SR, :) ; 

9 

PC = sort ( randsample (m, pc)); 

10 

PR = sort ( randsample (n , pr)); 

11 

PC = unique ([PC; SR]) ; % enforce PC to contain SR 

12 

PR = unique ([PR; SC]); % enforce PR to contain SC 

13 

U = pinv(C(PC, :)) * A(PC, PR) * pinv(R(:, PR)); 


Let’s consider the kernel approximation problem in Section 6.6.1. Let Xtrain G be 
the training data and Xtest ^ be the test data. We use the RBF kernel with kernel 

width parameter a. The m x n matrix K* can be approximated by K* = CUR, which is 
the output of the following MATLAB procedure. 


1 

2 

3 

4 

5 

6 

7 

8 
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10 

11 

12 

13 

14 

15 


function [C, U, R] = curFasterKernel (Xtest , Xtrain, sigma, c, r) 
pc = 2 * (r -I- c) ; % can be tuned 
pr = 2 * (r -I- c); % can be tuned 
m = size ( Xtest , 1) ; 
n = size(Xtrain, 1); 

SC = sort ( randsample (n , c)); 

SR = sort ( randsample (m, r)); 

C = rbf(Xtest , Xtrain(SC, :) , sigma) ; 

R= rbf (Xtest (SR, :), Xtrain, sigma); 

PC = sort ( randsample (m, pc)); 

PR = sort ( randsample (n , pr)); 

PC = unique ([PC; SR]) ; % enforce PC to contain SR 

PR = unique ([PR; SC]); % enforce PR to contain SC 

Kblock = rbf (Xtest (PC, :) , Xtrain(PR, :) , sigma); 

U = pinv(C(PC, :)) * Kblock * pinv(R(:, PR)); 


The time cost of this procedure is linear in m -|- n, and K* = CUR can be applied to n 
dimensional vector in 0{nr + me) time. In this way, the generalization of GPR and KPCA 
can be efficient. 
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6.7 Applications 

This section provides the implementations of kernel PCA, spectral clustering, Gaussian pro¬ 
cess regression, all sped-up by randomized algorithms. 

6.7.1 Kernel Principal Component Analysis (KPCA) 

Suppose we are given 

• n training data xi, • • • , x„ G 

• m test data x'^, • • • , x^ G M'^, (x' is not the transpose xf), 

• a kernel function k(-, •)> G.g. the RBF kernel function, 

• a target rank k (<C n, d). 

The goal of KPCA is to extract k features of each training datum and each test datum, which 
may be used in clustering or classification. The standard KPCA consists of the following 
steps: 

1. Training 

(a) Form the n x n kernel matrix K of the training data, whose the (i, j)-th entry is 

K(xi,Xj); 

(b) Compute the ^-eigenvalue decomposition = UfcAfcU^; 

(c) Form the n x k matrix UfcA].^^, whose the i-th row is the feature of x,; 

2. Generalization (test) 

(a) Form the mxn kernel matrix K* whose the (bj)-th entry is «:(x',Xj); 

_1 /o 

(b) Form the m x k matrix K^UfcA^ ' , whose the Ath row is the feature of x'. 

The most time and memory expensive step in training is the ^-eigenvalue decomposition of 
K, which can be sped-up by the sketching techniques discussed in this section. Empirically, 
the faster SPSD matrix sketching in Section 6.3 is much more accurate than the Nystrom 
method in Section 6.4, and their time and memory costs are all linear in n. Thus the faster 
SPSD matrix sketching can be better choice. KPCA can be approximately solved by several 
lines of MATLAB code. 


1 function [U, lambda, featuretrain ] = kpcaTrain (Xtrain , sigma, k) 
2S=k* 10;% can be tuned 

3 [QC, Z] = spsdFaster (Xtrain , sigma, s); % QC has orthogonal columns 

4 clear Xtrain 

5 [UZ, SZ, '] = svd(Z) ; 

6 U = QC * UZ(:, l:k); % U contains the top k eigenvectors 
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7 


8 

9 

10 


lambda = diag(SZ); 

lambda = lambda (l:k); % lambda is the vector containing the top k 
eigenvalues 

featuretrain = bsxfun (©times , U, ( sqrt (lambda) )’) ; 
end 


1 function [ featuretest ] = kpcaTest (Xtrain , Xtest , sigma, U, lambda) 

2 Ktest = rbf (Xtest, Xtrain, sigma); 

3 U = bsxfun (©times , U, (1 ./ sqrt (lambda)) ’) ; 

4 featuretest = Ktest * U; 

5 end 


In the function “kpcaTrain”, the input variable “Xtrain” has n rows, each of which corre¬ 
sponds to a training datum. The rows of the output “featuretrain” and “featuretest” are 
the features extracted by KPCA, and the features can be used to perform classihcation. For 
example, suppose each datum Xj is associated with a label yi, and let y = [yi, - ■ ■ , ynY' £ l^"'- 
We can use fc-nearest-neighbor 

1 [ytest] = knnclassify (featuretest , featuretrain , y) 

to predict the labels of the test data. 

When the number of test data m is large, the function “kpcaTest” is costly. The users 
should apply the CUR decomposition in Section 6.6.3 to speedup computation. 


1 function [featuretest] = kpcaTestCUR (Xtrain , Xtest, sigma, U, lambda) 

2 c = max(100, ceil ( size (Xtrain , 1) / 20)); % can be timed 

3 r = max(100 , ceil ( size (Xtest , 1) / 20)) ; % can be tnned 

4 [C, Utilde , R] = curFasterKernel (Xtest , Xtrain, sigma, c, r); 

5 U = bsxfun( ©times , U, (1 ./ sqrt( lambda)) ’) ; 

6 featuretest =C * (Utilde * (R * U)); 

7 end 


6.7.2 Spectral Clustering 

Spectral clustering is one of the most popular clustering algorithms. Suppose we are given 

• n data points xi, • • ■ , x„ G M'^, 

• a kernel function k{-, •), 

• k\ the number of classes. 

Spectral clustering performs the following operations: 

1. Form an n X n kernel matrix K, where big kij indicates Xj and Xj are similar; 

2. Form the degree matrix D with da = Ylj ^md dij = 0 for all i ^ j', 
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3. Compute the normalized graph Laplacian G = G 

4. Compute the top k eigenvectors of G, denote U G and normalize the rows of U; 

5. Apply fcmeans clustering on the rows of V to obtain the class labels. 

The hrst step costs 0{n‘^d) time and the fourth step costs 0{n‘^k) times, which limit the 
scalability of spectral clustering. Fowlkes et al. [11] proposed to apply the Nystrom method 
to make spectral clustering more scalable by avoiding forming the whole kernel matrix and 
speeding-up the /c-eigenvalue decomposition. Empirically, the algorithm in Section 6.3 is 
more accurate than the Nystrom method in Section 6.4, and they both runs in linear time. 
Spectral clustering with the randomized algorithm in Section 6.3 can be implemented in 16 
lines of MATLAB code. 


1 
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11 
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13 

14 

15 

16 


function [labels] = SpectralClusteringFaster (X, sigma, k) 
s = k * 10; % can be tuned 
n = size (X, 1) ; 

[QC, Z] = spsdFaster (X, sigma, s); %K is approximated by QC * Z * QC’ 
[UZ, SZ, '] = svd(Z) ; 

SZ = sqrt { diag (SZ)); 

UZ = bsxfun (©times , UZ, SZ ’) ; % now Z = UZ * UZ’ 

L = QC * UZ; % now K is approximated by L * L’ 
d = ones(n, 1) ; 

d = L * (L’ * d) ; % diagonal of the degree matrix D 
d = 1 ./ sqrt (d); 

L = bsxfun (©times , L, d) ; % now G is approximated by L*L’ 

[U, ~, ~] = svd(L, ’econ’); 

U = U(: , l:k) ; 

U = normr(U) ; % normalize the rows of U 
labels = kmeans(U, k, ’Replicates’, 3); 


When the scale of data is too large for the faster SPSD matrix sketching algorithm in 
Section 6.3, one can instead use the more efficient Nystrom method in Section 6.4: simply 
replace Lines 4 to 8 by 

1 L = Nystrom (X, sigma, s); 


6.7.3 Gaussian Process Regression (GPR) 

The Gaussian process regression (GPR) is one of the most popular machine learning methods. 
GPR is the foundation of Bayesian optimization and has important applications such as 
automatically tuning the hyper-parameters of deep neural networks. Suppose we are given 

• n training data xi, • • • , G R*^, 

• labels y = [t/i, • ■ ■ t/n]^ U K"' of the training data, 

• m test data x'^, • • ■ , x(„ G M'^, (x' is not the transpose xf), 
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• and a kernel fnnction k{-, •), e.g. the RBF kernel with kernel width parameter a. 


Training. In the training stage, GPR reqnires forming the n x n kernel matrix K where 
kij = fi:(xj,Xj) and compnting the model 

w = (K + aI„)"V- 

Here a is a tnning parameter that indicates the noise intensity in the labels y. It takes 
0{'n?d) time to form the kernel matrix and 0{'n?) time to compnte the matrix inversion. To 
make the training efficient, we can hrst sketch the SPSD matrix K to obtain K LL^ and 
then apply the techniqne in Section 6.1.2 to obtain w. Empirically, when applied to speednp 
GPR, the algorithms discnssed in Section 6.3 and Section 6.4 has similar accnracy, thns we 
choose to nse the Nystrom method which is more efficient. 

The training GPR with the Nystrom approximation can be implemented in the following 
MATLAB code. The time cost is 0{nP + nld) and the space cost is 0{nl + nd). 


1 function [w] = gprTrain (Xtrain , ytrain , sigma, alpha) 

2 1 = 100; % can be tuned 

3 L = Nystrom(Xtrain , sigma, 1); %K is approximated by L * L’ 

4 1 = size (L, 2); 

5 w = L ’ * ytrain ; 

6 w = (alpha * eye (1) + L’ * L) \ w; 

7 w = ytrain — L * w; 

8 w = w / alpha ; 

9 end 


The inpnt “sigma” is the kernel width parameter and “alpha” indicates the noise intensity 
in the observation. 

Generalization (test). After obtaining the trained model w G M”, GPR can predict the 
nnknown labels of the m test data x'^, • • ■ , G GPR forms an m x n kernel matrix K* 
whose the (qj)-th entry is fi:(x',Xj) and compute y* = K*w G M™'. The i-th entry in y* is 
the predictive label of x'. The generalization can be implemented in four lines of MATLAB 
code. 


1 function [ytest] = gprTest (Xtrain , Xtest , sigma, w) 

2 Ktest = rbf(Xtest, Xtrain, sigma); 

3 ytest = Ktest * w; 

4 end 


It costs 0{mnd) time to compute K* and 0{mn) time to apply K* to w. If m is small, 
the generalization stage can be performed straightforwardly. However, if m is as large as n, 
the time cost will be quadratic in n, and the user should apply the GUR decomposition in 
Section 6.6.3 to speedup computation. 


44 




1 function [ytest] = gprTestCUR (Xtrain , Xtest , sigma, w) 

2 c = max(100, ceil ( size (Xtrain , 1) / 20)); % can be tuned 

3 r = max(100 , ceil(size( Xtest, 1) / 20 )); % can be tuned 

4 [C, Utilde , R] = curFasterKernel (Xtest , Xtrain, sigma, c, r); 

5 ytest = C * (Utilde * (R * w)) ; 

6 end 


45 



46 



Appendix A 

Several Facts of Matrix Algebra 


This chapter lists some facts that has been applied in this paper. 

Fact A.l. The matrices Qi G and Qnxp (m > n > p) have orthonormal columns. 

Then the matrix Q = Q1Q2 has orthonormal columns. 

Fact A.2. Let A be any mxn and rankp matrix. Then AA'I’B = UaU^B = AX* = UaZ*, 
where 

X* = argmin ||B — AX||^, and Z* = argmin ||B — UaZ||^. 

X z 

This is the reason why AA'I’B and UaU^B are called the projection o/B onto the column 
space of A. 

Fact A.3. [ 34 , Lemma 44 ] The matrices Q G (m > s) has orthonormal columns. The 
solution to 

argmin ||A — QX|||. 

rank(X)</c 

is X* = (Q^A)fc, where (Q^A)^ denotes the closest rank k approximation to Q^A. 

Fact A.4. Let A'^ he the Moore-Penrose inverse of A. Then AA'I’A = A and A'I’AA'I’ = Ah 

Fact A.5. Let A he an mxn (m > n) matrix and A = QaRa be the QR decomposition of 
A. Then 



nxn nxm 


Fact A.6. Let C be a full-rank matrix with more rows than columns. Let C = QcRc 
the QR decomposition and C = UcScVc he the condensed SVD. Then the leverage scores 
ofC, Qc, Uc are the same. 
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Appendix B 

Notes and Further Reading 


The Regression Problems. Chapter 4 has applied the sketching methods to solve 
the I2 norm regression problem more efficiently. The more general ip regression problems 
have also been studied in the literature [5; 7; 17; 8]. Especially, the ii is of great interest 
because it demonstrate strong robustness to noise. Currently the strongest result is the ip 
row sampling by Lewis weights [8] . 

Distributed SVD. In the distributed model, each machine holds a subset of columns of 
A, and the system outputs the top singular values and singular vectors. In this model, the 
communication cost should also be considered, as well as the time and memory costs. The 
seminal work [10] proposed to build a coreset to capture the properties of A, which facilitates 
low computation and communication costs. Later on, several algorithms with stronger error 
bound and lower communication and computation costs have been proposed. Currently, the 
state of the art is [2]. 

Random Feature for Kernel Methods. Chapter 6 has introduced the sketching 
methods for kernel methods. A parallel line of work is the random feature methods [22] 
which also form low-rank approximations to kernel matrices. Section 6.5.3 of [27] offers simple 
and elegant proof of a random feature method. Since the sketching methods usually works 
better than the random feature methods (see the examples in [35]), the users are advised 
to apply the sketching methods introduced in Chapter 6. Besides the two kinds of low-rank 
approximation approaches, the stochastic optimization approach [9] also demonstrates very 
high scalability. 
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